Discovering Missing Values in Semi-Structured Databases

نویسندگان

  • Xing Yi
  • James Allan
  • Victor Lavrenko
چکیده

We explore the problem of discovering multiple missing values in a semi-structured database. For this task, we formally develop Structured Relevance Model (SRM) built on one hypothetical generative model for semi-structured records. SRM is based on the idea that plausible values for a given field could be inferred from the context provided by the other fields in the record. Small-scale experiments on IMDb (Internet Movie Database) show that SRM matched three state-of-the-art relational learning approaches on the movie label prediction tasks. Large-scale experiments on a snapshot of the National Science Digital Library (NSDL) repository show that SRM is highly effective at discovering possible values for free-text fields even with quite modest amounts of training data, compared with state-of-the-art machine learning approaches.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficient Frequent Pattern Mining Techniques of Semi Structured data: a Survey

Semi-structured data are a huge amount of complex and heterogeneous data sets. Such models capture data that are not intentionally structured, but are structured heterogeneously. These databases evolve so quickly like run time report generated by ERPs, World-Wide Web with its HTML pages, text files, bibliographies, various logs generated etc. These huge and varied become difficult to retrieve r...

متن کامل

Data Integration: An Alternative Perspective

Given the rapid growth of the Internet and other on-line information repositories, it is increasing important to integrate a wider variety of data formats and data found on the Web. In this paper, we ooer an alternative perspective on the integration of data from multiple sources such as traditional databases and semi-structured data sources on the Web. We observe that a lot of work on the inte...

متن کامل

Discovering Association Rules in Semi-structured Data Sets

The discovery of association rules is one of the classic problems of data mining. Typically, it is done over well-structured data, such as databases. In this paper, we present a method of discovery of association rules in semi-structured data, namely, in a set of conceptual graphs. The method is based on conceptual clustering of the data and constructing of a conceptual hierarchy. A feature of ...

متن کامل

Survey on Mining in Semi-Structured Data

Emerging technologies of semi-structured data have attracted wide attention of networks, e-commerce, information retrieval and databases. In these applications, the data are modeled not as static collections but as transient data streams, where the data source is an unbounded stream of individual data items. It is becoming increasingly popular to send heterogeneous and ill-structured data throu...

متن کامل

Advances in Holistic Ontology Alignment

The development of the semantic Web has given birth to a large number of data sources (ontologies) with independent data and schemas. These ontologies are usually generated from existing relational databases or extracted from semi-structured data. Ontology alignment is a technique to automatically integrate them by discovering overlap in their instances and similarities in their schemas. Such i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007